Tree methods

In this lab, we’ll look at how to implement a range of tree methods including basic classification and regression trees, random forests and boosted regression trees. We will also look at how to predict for new data and how to automate hyperparameter tuning.

We’ll illustrate these methods by using them to build a species distribution model for the pinon pine in the western United States. We’ll also quickly demonstrate how to build a regression tree for the California housing dataset. You will need the following datasets from Canvas, so download these to your datafiles folder (extract any zip files). Make a new folder for today’s class called lab08.

  • Pinus_edulis.csv
  • ne_50m_admin_0_countries.zip
  • current.env.RData
  • future.env.RData

Python users

If you are using Python for today’s lab, you’ll need to download the Jupyter notebook for this lab (GEOG_5160_6160_lab08.ipynb). Make a new folder for the lab (lab08) and move the notebook to this. Now open a new terminal (in Windows go to the [Start Menu] > [Anaconda (64-bit)] > [Anaconda prompt]).

CHECK THIS YOU FOOL

You will need to make sure the following libraries are installed on your computer:

  • dismo: utility functions for species distribution models
  • rpart.plot: better CART graphics
  • vip: better variable importance plots
  • pdp: partial dependency plots

Once opened, change directory to the folder you just made, activate your conda environment

conda activate geog5160

Start the Jupyter Notebook server:

jupyter notebook

And open the notebook for today’s class.

RStudio users

Start RStudio and set the working directory to this directory (This can be changed by going to the [Session] menu and selecting [Set working directory] -> [Choose directory…], then browsing to the folder).

You will need to make sure the following libraries are installed on your computer:

  • sf: functions for working with spatial data
  • raster: functions for working with gridded spatial data
  • dismo: utility functions for species distribution models
  • tree: simple classification and regression trees
  • rpart.plot: better CART graphics
  • vip: better variable importance plots
  • pdp: partial dependency plots
  • RColorBrewer: color palettes for mapping

Data processing

Before doing anything else, run the following command and make sure that you can see the files you downloaded.

list.files()

Next load the libraries you will need for the lab. You should at this stage have most of these already installed. Add anything that is not installed using the install.packages() function.

library(dplyr)
library(raster)
library(sf) 
library(dismo)
library(tree)
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(vip)
library(pdp)
library(RColorBrewer)

Getting data 1: location data

California housing data

We’ll start by reading in and plotting the California housing dataset. The following code reads the file and cleans up the data. We then create two data sets to show how the algorithm works, one with only the coordinates as features, and the other with the usual set of house characteristics as features. The only new code here divides the house values by 1000 to help in displaying some of the output. Refer back to the previous labs for explanations of this code:

housing <- read.csv("../datafiles/housing.csv")
housing <- housing %>%
  dplyr::filter(!is.na(total_bedrooms))
housing <- housing %>%
  mutate(avg_rooms = total_rooms / households,
         bedroom_ratio = total_bedrooms / total_rooms,
         median_house_value = median_house_value / 1000)
housing2 <- housing %>%
  dplyr::select(longitude, latitude, median_house_value)
housing3 <- housing %>%
  dplyr::select(-longitude, -latitude, -total_rooms, -total_bedrooms, -ocean_proximity)

Let’s make a quick plot (we’ll use this later to illustrate the regression tree)

price.deciles <- quantile(housing2$median_house_value, 0:10/10)
cut.prices <- cut(housing2$median_house_value,price.deciles,include.lowest=TRUE)
plot(housing2$longitude,housing2$latitude,col=grey(10:2/11)[cut.prices],pch=20,
     xlab="Longitude",ylab="Latitude", asp = 1)

Pinus edulis data

Next, we’ll read the known locations of Pinus edulis trees from the file Pinus_edulis.csv, and plot to check the data

pe <- read.csv("../datafiles/Pinus_edulis.csv")
pe <- st_as_sf(pe, coords = c("longitude", "latitude"))
borders <- st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp")
## Reading layer `ne_50m_admin_0_countries' from data source `/Users/u0784726/Dropbox/Data/devtools/geog5160/datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp' using driver `ESRI Shapefile'
## Simple feature collection with 241 features and 94 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
## geographic CRS: WGS 84
plot(st_geometry(pe), xlim = c(-120,-80), ylim = c(25,55), 
     pch = 21, bg = "darkorange", axes=TRUE)
plot(st_geometry(borders), add = TRUE)

Getting online data [optional]

If you need to get your own species data, the dismo package includes a function gbif() that allows you to download records directly from the GBIF. To demonstrate this, we’ll use it here to get occurrence records of the timber rattlesnake Crotalus horridus. This function allows selection by Latin name, so here we specify the genus and species names separately (this will take a few second to retrieve all the records:

rattler <- gbif('crotalus','horridus')

The returned object has much more information about the occurrence records, including metadata about the original study that supplied the data. Note that all records are returned irrespective of whether or not they have associated geolocations, so we subset only the records that have coordinates (the function does have an argument to exclude records with no longitude or latitude).

rattler <- subset(rattler, !is.na(rattler$lon))
rattler <- st_as_sf(rattler, coords = c("lon", "lat"))

Finally we plot the records using the same code as before

plot(st_geometry(rattler), pch = 21, bg = "green", axes = TRUE)
plot(st_geometry(borders), add = TRUE)

Note that there are some odd locations, including an observation in Turkey. Before using these data in modeling, we would want to verify and potentially remove these records.

Getting data 2: environmental data

There are a large number of available data sources for environmental data that can be used in species distribution models. We’ll take data from the Worldclim project (Hijmans et al. 2005), a collection of standardized climate data at a variety of spatial resolutions. This data can be downloaded directly using the getData() function, which allows direct downloads from this and other datasets. The data contains monthly averages of temperature and precipitation and a set of bioclimatic variables, which represent aggregate climate variables assumed to be linked to species distributions. I’ve already downloaded the bioclimatic variables for you and clipped them to the region you’re going to work in. These are available in two RData files containing modern (current.env.RData) and future (future.env.RData) climates for the study area.

load("../datafiles/current.env.RData")
load("../datafiles/future.env.RData")

The data are in R’s raster format, but as an 3D array. If you type the name of the objects that have been loaded, you see an overview of the data:

current.env
## class      : RasterStack 
## dimensions : 480, 720, 345600, 19  (nrow, ncol, ncell, nlayers)
## resolution : 0.04166667, 0.04166667  (x, y)
## extent     : -130, -100, 30, 50  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## names      :   bio1,   bio2,   bio3,   bio4,   bio5,   bio6,   bio7,   bio8,   bio9,  bio10,  bio11,  bio12,  bio13,  bio14,  bio15, ... 
## min values :   -7.1,    5.6,    2.4,  165.9,    6.0,  -24.1,   11.8,  -13.5,  -15.3,    0.5,  -15.9,   46.0,    7.0,    0.0,    5.0, ... 
## max values :   23.9,   21.3,    6.5, 1299.2,   45.6,   10.0,   50.9,   33.2,   32.9,   36.0,   15.4, 3345.0,  556.0,   77.0,  123.0, ...

This tells us that the files are 480*720 pixels, with 19 layers, each representing a single climate variables (the variables are listed in the appendix of today’s lab). There’s also information about the extent and resolution of the data. If you need to get different data or data for another region, look at the help for the getData() function.

You can extract any single layer using the raster() function (e.g. raster(current.env, 1)) will give you the first layer). We can use this to visualize the data:

plot(raster(current.env, 1))
plot(st_geometry(pe), add = TRUE, pch = 21, bg = "olivedrab")

The RColorBrewer package provides several color palettes for spatial data. If you have this installed, you can change the colors as follows:

my.pal <- brewer.pal(9, "YlOrRd")
plot(raster(current.env, 1), col = my.pal)

Merging locations and environment

Finally, we make up a dataset for modeling. We want to model the species distribution using a binary classification task, where 1 indicates presence and 0 indicates absence. However, we only have observations of presences (nobody records when they don’t see a species). TO get around this, we’ll generate some ‘pseudo-absences’: locations that represent an absence of this species. There are several ways to generate these. One of the easiest is to simply pick random locations within the study area using the randomPoints() function. This takes as inputs:

  • A mask layer defining the study area
  • The number of selected points (we’ll get the same number as observed values)
  • The set of presence points

This generates a matrix of coordinates that we then need to convert to a spatial sf object. This is a bit long-winded, but allows to easily plot the new points.

absence <- randomPoints(mask = raster(current.env, 1),  
                        n = nrow(pe), p = st_coordinates(pe))
absence <- st_as_sf(as.data.frame(absence), coords = c("x", "y"))

Now we can compare the distribution of presences and absences:

plot(raster(current.env, 1))
plot(st_geometry(pe), pch = 21, bg = "olivedrab", add = TRUE)
plot(st_geometry(absence), pch = 21, bg = "white", add = TRUE)

One problem with this method is that we sample absences very far from the observed points. As the environment of these points is very unlike the places where the species are found, these will always predict as absences, and can bias our assessment of the model. A better approach is to restrict the absences to a small region around the observed points. We could crop the environmental mask to a smaller region, or use a buffer approach. We’ll do the second of these here. We first general 200km buffers around each observed presence using the circles() function. Then use the spsample() function to pick random points within these buffers:

x <- circles(st_coordinates(pe), d = 200000, lonlat = TRUE)
## Loading required namespace: rgeos
absence <- as(spsample(x@polygons, type = 'random', n = nrow(pe)), "sf")
## Warning in proj4string(obj): CRS object has comment, which is lost in output
plot(raster(current.env,1))
plot(st_geometry(pe), pch = 21, bg = "olivedrab", add = TRUE)
plot(st_geometry(absence), pch = 21, bg = "white", add = TRUE)

Now we have the presence and absence points, we need to extract the environmental values for these points (these will be a features for machine learning). To do this we use the following steps:

  • Concatenate the presence and absence coordinates
  • Use the extract() function to get the associated climate values from the environmental grids
  • Create a binary (0/1) vector for use as labels
  • Combine the features and labels into a data frame
  • Reset the row names to numeric indices
pe.crds <- rbind(st_coordinates(pe), st_coordinates(absence))
pe.env <- raster::extract(current.env, pe.crds)
pe.pa <- as.factor(c(rep(1, nrow(pe)), rep(0, nrow(pe))))

pe.df <- data.frame(pe.crds, pa = pe.pa, pe.env)
row.names(pe.df) <- 1:598

And having done all that, we can now move on to building our models.

Classification and regression trees

Classification and Regression Trees (CART) is a non-linear, non-parametric modeling approach that can be used with a wide variety of data. Regression trees are used with continuous outcome data, and classification trees with binary or categorical data, but the interface for these is the same in R.

R has a number of packages for performing CART and associated analyses. We’re going to use the tree package to demonstrate how this works with the California data, then we’ll move to using mlr3 which uses the rpart package. While the syntax will be different, there is little difference in the way these work:

Regression Trees

We will start by fitting a regression tree to the California housing data. The main function is tree(), and this will automatically choose a regression tree approach if the response variable is numerical (we’ll build a classification tree in the next section).

housing.tree <- tree(median_house_value ~ longitude + latitude,
                     data = housing2)

If you now type the name of the object containing the output, you will see the tree structure in text format.

housing.tree

Each line shows a split in tree, with the condition, the number of observations that conform to that condition, the deviance (a derived loss function), and the value of the median_house_value in that node. Splits that are indented represent subsequent partitions.

plot(housing.tree)
text(housing.tree)

The tree package has a function (partition.tree) that allows you to visualize the output of a CART model in the original parameter space. As the only features we used in this model with the coordinates, we can use this here to overlay the partitions on our map of house values:

plot(housing2$longitude, housing2$latitude, col = grey(10:2/11)[cut.prices],
     pch = 20, xlab = "Longitude", ylab = "Latitude", asp = 1)
partition.tree(housing.tree ,ordvars = c("longitude","latitude"), add = TRUE)

In general this does a fairly good job of identifying the expected areas of higher values along the coast and around the bigger cities. As a comparison, we’ll now build a regression tree using the usual set of features in housing3:

housing.tree <- tree(median_house_value ~ .,
                     data = housing3)
plot(housing.tree)
text(housing.tree)

Classification Trees

Next, we’ll build a classification model for the Pinus edulis data set using mlr3. CART methods are implemented in mlr3 using the rpart package and the two learners are classif.rpart for a classification tree and regr.rpart for a regression tree.

As a recall, we need to identify the task, define a performance measure and a resampling strategy. Then we train the learner and evaluate the outcomes

  • Set task
task_pe <- TaskClassif$new(id = "pe", backend = pe.df, 
                           target = "pa")

## Check the task details
task_pe$col_roles
## $feature
##  [1] "X"     "Y"     "bio1"  "bio10" "bio11" "bio12" "bio13" "bio14" "bio15"
## [10] "bio16" "bio17" "bio18" "bio19" "bio2"  "bio3"  "bio4"  "bio5"  "bio6" 
## [19] "bio7"  "bio8"  "bio9" 
## 
## $target
## [1] "pa"
## 
## $name
## character(0)
## 
## $order
## character(0)
## 
## $stratum
## character(0)
## 
## $group
## character(0)
## 
## $weight
## character(0)

The task is currently using all features including the coordinates (X and Y). We can exclude these from the feature list:

task_pe$col_roles$feature <- setdiff(task_pe$col_roles$feature,
                                     c("X", "Y"))
task_pe$feature_names
##  [1] "bio1"  "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17"
## [10] "bio18" "bio19" "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8" 
## [19] "bio9"
  • Set performance measure
measure = msr("classif.auc")
  • Define learner (leave parameters at defaults)
lrn_ct = lrn("classif.rpart", predict_type = "prob")
  • Define the resampling strategy (we’ll use a simple hold-out with 20% of samples in the test set)
resamp_hout = rsmp("holdout", ratio = 0.8)
resamp_hout$instantiate(task_pe)

We then run the resampler and examine the performance

rr = resample(task_pe, lrn_ct, resamp_hout, store_models = TRUE)
rr$score(measure)
##                 task task_id                   learner    learner_id
## 1: <TaskClassif[45]>      pe <LearnerClassifRpart[32]> classif.rpart
##                 resampling resampling_id iteration              prediction
## 1: <ResamplingHoldout[19]>       holdout         1 <PredictionClassif[19]>
##    classif.auc
## 1:   0.7962489
rr$aggregate(measure)
## classif.auc 
##   0.7962489

With the default settings, this has worked fairly well, but we will next try to improve on this by tuning the hyperparameters. Before doing so, we can extract any of the trees that were built during the resampling. These are stored in a list in rr$learners, and you can select an individual models using R’s list index [[i]]. As we are using a hold-out, there is only one learner in this list ([[1]]). With a 5-fold cross-validation, there would be 5 ([[1]]...[[5]]).

rr$learners[[1]]$model
## n= 478 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 478 230 1 (0.48117155 0.51882845)  
##     2) bio17< 31.5 108  12 0 (0.88888889 0.11111111) *
##     3) bio17>=31.5 370 134 1 (0.36216216 0.63783784)  
##       6) bio1< 4.75 47   6 0 (0.87234043 0.12765957) *
##       7) bio1>=4.75 323  93 1 (0.28792570 0.71207430)  
##        14) bio5>=32.35 75  31 0 (0.58666667 0.41333333)  
##          28) bio17< 46.5 41   7 0 (0.82926829 0.17073171) *
##          29) bio17>=46.5 34  10 1 (0.29411765 0.70588235) *
##        15) bio5< 32.35 248  49 1 (0.19758065 0.80241935)  
##          30) bio12< 371 85  32 1 (0.37647059 0.62352941)  
##            60) bio10< 21.75 75  32 1 (0.42666667 0.57333333)  
##             120) bio8< 10.55 14   3 0 (0.78571429 0.21428571) *
##             121) bio8>=10.55 61  21 1 (0.34426230 0.65573770) *
##            61) bio10>=21.75 10   0 1 (0.00000000 1.00000000) *
##          31) bio12>=371 163  17 1 (0.10429448 0.89570552)  
##            62) bio13< 50.5 20   8 1 (0.40000000 0.60000000)  
##             124) bio2< 16 7   1 0 (0.85714286 0.14285714) *
##             125) bio2>=16 13   2 1 (0.15384615 0.84615385) *
##            63) bio13>=50.5 143   9 1 (0.06293706 0.93706294) *

We’ll use the rpart.plot package to visualize the first of these. This allows you to make a large number of tweaks to a tree plot. Here extra=106 adds the following to each terminal node: the predicted value, the proportion of 1s and the percentage of observations in that node:

library(rpart.plot)
## Loading required package: rpart
prp(rr$learners[[1]]$model, extra = 106, roundint = FALSE)

Tuning

Classification and regression trees have a large number of hyper parameters and benefit from tuning. We can do this using the mlr3tuning package. There are several steps here:

1. Define the task, learner and measure

We’ll use the definitions from the previous section

2. Define the parameters to test

Parameter sets can be generated using the paradox package. This should have been installed along with mlr3, so load this now.

library(paradox)

Next check the available parameters for our learner

lrn_ct$param_set
## <ParamSet>
##                 id    class lower upper      levels        default value
##  1:       minsplit ParamInt     1   Inf                         20      
##  2:      minbucket ParamInt     1   Inf             <NoDefault[3]>      
##  3:             cp ParamDbl     0     1                       0.01      
##  4:     maxcompete ParamInt     0   Inf                          4      
##  5:   maxsurrogate ParamInt     0   Inf                          5      
##  6:       maxdepth ParamInt     1    30                         30      
##  7:   usesurrogate ParamInt     0     2                          2      
##  8: surrogatestyle ParamInt     0     1                          0      
##  9:           xval ParamInt     0   Inf                         10     0
## 10:     keep_model ParamLgl    NA    NA  TRUE,FALSE          FALSE

This table also gives the type of parameter (e.g. double precision or integer), the lower and upper bounds and the default value. We’ll test values for the complexity parameter (cp) between 0.001 and 0.1, and the minimum number of observations to considering partitioning a node (minsplit) from 1 to 12.

## Build parameter set
tune_ps = ParamSet$new(list(
  ParamDbl$new("cp", lower = 0.0001, upper = 0.1),
  ParamInt$new("minsplit", lower = 2, upper = 20)
))
tune_ps
## <ParamSet>
##          id    class lower upper levels        default value
## 1:       cp ParamDbl 1e-04   0.1        <NoDefault[3]>      
## 2: minsplit ParamInt 2e+00  20.0        <NoDefault[3]>

3. Define a stopping condition

Next we defining one or more stopping criteria for the tuning. This is largely to ensure that tuning for highly complex algorithms is manageable. The available stopping criteria include:

  • limiting clock time
  • limiting the number of iterations or evaluations
  • stopping when a specific level of performance has been reached
  • stopping when performance no longer improves by more than some amount

We’ll use the second of these. The function to set the terminator is term(), and we set the number of evaluations to 50.

evals = trm("evals", n_evals = 50)

4. Define the Tuner

Now, we set up a sampling strategy for searching among different hyperparameter values. There are a couple of options here; we will use a grid search, where the argument resolution gives the number of steps between the lower and upper bounds defined in our ParamSet.

tuner = tnr("grid_search", resolution = 10)

5. Run the tuner

The mlr3tuning package offers a couple of ways to tune. Either by first running the tuning, then using these parameters to train the final model, or combining these using AutoTuner(). We’ll use the second of these here - it’s a little neater, and has one other advantage as we will see in later.

First create a new AutoTuner using the various other functions and parameters that we have just defined

at_ct = AutoTuner$new(learner = lrn_ct, 
                      resampling = rsmp("holdout"),
                      measure = measure, 
                      search_space = tune_ps,
                      terminator = evals,
                      tuner = tuner)

Note that we use a holdout method in the AutoTuner to split the data. This will be used to assess how the model’s skill changes as we vary the parameters. Each time it will evaluate the measure on the holdout test set. Whichever parameter set gives the best performance will then be automatically used to train a final model. The AutoTuner object inherits from the Learner methods we have previously seen, so to tune and train the model, just type:

at_ct$train(task_pe)

We can then see the results in at_ct$learner, showing a value of cp of 10^{-4} and minsplit of 18.

at_ct$learner

Nested resampling

The previous code has trained a model, but it has not evaluated it. While this used a resampling strategy (the holdout), this is only used to select the best value of the hyperparameters. To evaluate the final trained model, we need to use an independent dataset. Fortunately, this is quite easy to set up using the AutoTuner.

To understand the following code, we need to define the inner vs. the outer resampling strategy.

  • The outer strategy divides the dataset into a training and testing data set, where the test set is used to evaluate the predictive skill of the model
  • The inner strategy takes the training data, and divides it into two new sets to tune the model - one set to train for each combination of parameters, and one set to evaluate and help select the best values of these parameters

Here, we’ll use a hold-out for the inner strategy, and a 3-fold cross validation for the outer.

resampling_inner = rsmp("holdout", ratio = 0.8)
resampling_outer = rsmp("cv", folds = 3)

Next we’ll remake the AutoTuner with the inner strategy:

at_ct = AutoTuner$new(learner = lrn_ct, 
                      resampling = resampling_inner,
                      measure = measure, 
                      search_space = tune_ps,
                      terminator = evals,
                      tuner = tuner)

And now run resample using the AutoTuner as the learner, and the outer resampling strategy (this will take a minute or so to run):

rr_ct = resample(task = task_pe, learner = at_ct, 
                 resampling = resampling_outer, store_models = TRUE)

This will take a little while to run; remember that this is dividing the original data set up three times, then for each one tuning the parameters across the parameter set/resolution. You can see the selected set of parameters for each of the outer cross-validation folds as follows:

sapply(rr_ct$learners, function(x) x$param_set$values)
##          [,1]   [,2]  [,3]  
## xval     0      0     0     
## cp       0.0445 1e-04 0.0112
## minsplit 14     18    10
rr_ct$score(measure) 
##                 task task_id         learner          learner_id
## 1: <TaskClassif[45]>      pe <AutoTuner[37]> classif.rpart.tuned
## 2: <TaskClassif[45]>      pe <AutoTuner[37]> classif.rpart.tuned
## 3: <TaskClassif[45]>      pe <AutoTuner[37]> classif.rpart.tuned
##            resampling resampling_id iteration              prediction
## 1: <ResamplingCV[19]>            cv         1 <PredictionClassif[19]>
## 2: <ResamplingCV[19]>            cv         2 <PredictionClassif[19]>
## 3: <ResamplingCV[19]>            cv         3 <PredictionClassif[19]>
##    classif.auc
## 1:   0.7781804
## 2:   0.8315096
## 3:   0.8179200
rr_ct$aggregate(measure)
## classif.auc 
##   0.8092033

We get a small improvement here, but not much, and in general, rpart doesn’t improve a lot with tuning. We’ll turn now to some more complex methods, to see if we can improve on these results.

Random forest

Next, we’ll try to build a random forest for the Pinus data. The mlr3 package uses a random forest routine from the ranger package that can run on multiple cores. To build this, all we need to do is create a new learner (classif.ranger), and run the hold-out resampling on it:

lrn_rf = lrn("classif.ranger", 
             predict_type = "prob", 
             importance = "permutation")
rr = resample(task_pe, lrn_rf, resamp_hout, store_models = TRUE)
rr$score(measure)
##                 task task_id                    learner     learner_id
## 1: <TaskClassif[45]>      pe <LearnerClassifRanger[32]> classif.ranger
##                 resampling resampling_id iteration              prediction
## 1: <ResamplingHoldout[19]>       holdout         1 <PredictionClassif[19]>
##    classif.auc
## 1:   0.8783745
rr$aggregate(measure)
## classif.auc 
##   0.8783745

The default random forest shows an improvement over the CART model. We’ll next try to tune this to see if we can improve it’s performance. First, let’s take a look at the set of available hyperparameters:

lrn_rf$param_set
## <ParamSet>
##                               id    class lower upper
##  1:                        alpha ParamDbl  -Inf   Inf
##  2:       always.split.variables ParamUty    NA    NA
##  3:                class.weights ParamDbl  -Inf   Inf
##  4:                      holdout ParamLgl    NA    NA
##  5:                   importance ParamFct    NA    NA
##  6:                   keep.inbag ParamLgl    NA    NA
##  7:                    max.depth ParamInt  -Inf   Inf
##  8:                min.node.size ParamInt     1   Inf
##  9:                     min.prop ParamDbl  -Inf   Inf
## 10:                      minprop ParamDbl  -Inf   Inf
## 11:                         mtry ParamInt     1   Inf
## 12:            num.random.splits ParamInt     1   Inf
## 13:                  num.threads ParamInt     1   Inf
## 14:                    num.trees ParamInt     1   Inf
## 15:                    oob.error ParamLgl    NA    NA
## 16:                  predict.all ParamLgl    NA    NA
## 17:        regularization.factor ParamUty    NA    NA
## 18:      regularization.usedepth ParamLgl    NA    NA
## 19:                      replace ParamLgl    NA    NA
## 20:    respect.unordered.factors ParamFct    NA    NA
## 21:              sample.fraction ParamDbl     0     1
## 22:                  save.memory ParamLgl    NA    NA
## 23: scale.permutation.importance ParamLgl    NA    NA
## 24:                    se.method ParamFct    NA    NA
## 25:                         seed ParamInt  -Inf   Inf
## 26:         split.select.weights ParamDbl     0     1
## 27:                    splitrule ParamFct    NA    NA
## 28:                      verbose ParamLgl    NA    NA
## 29:                 write.forest ParamLgl    NA    NA
##                               id    class lower upper
##                                           levels        default    parents
##  1:                                                         0.5           
##  2:                                              <NoDefault[3]>           
##  3:                                                                       
##  4:                                   TRUE,FALSE          FALSE           
##  5: none,impurity,impurity_corrected,permutation <NoDefault[3]>           
##  6:                                   TRUE,FALSE          FALSE           
##  7:                                                                       
##  8:                                                           1           
##  9:                                                         0.1           
## 10:                                                         0.1           
## 11:                                              <NoDefault[3]>           
## 12:                                                           1  splitrule
## 13:                                              <NoDefault[3]>           
## 14:                                                         500           
## 15:                                   TRUE,FALSE           TRUE           
## 16:                                   TRUE,FALSE          FALSE           
## 17:                                                           1           
## 18:                                   TRUE,FALSE          FALSE           
## 19:                                   TRUE,FALSE           TRUE           
## 20:                       ignore,order,partition         ignore           
## 21:                                              <NoDefault[3]>           
## 22:                                   TRUE,FALSE          FALSE           
## 23:                                   TRUE,FALSE          FALSE importance
## 24:                                 jack,infjack        infjack           
## 25:                                                                       
## 26:                                              <NoDefault[3]>           
## 27:                              gini,extratrees           gini           
## 28:                                   TRUE,FALSE           TRUE           
## 29:                                   TRUE,FALSE           TRUE           
##                                           levels        default    parents
##           value
##  1:            
##  2:            
##  3:            
##  4:            
##  5: permutation
##  6:            
##  7:            
##  8:            
##  9:            
## 10:            
## 11:            
## 12:            
## 13:            
## 14:            
## 15:            
## 16:            
## 17:            
## 18:            
## 19:            
## 20:            
## 21:            
## 22:            
## 23:            
## 24:            
## 25:            
## 26:            
## 27:            
## 28:            
## 29:            
##           value

The two most frequently tuned parameters are mtry (the number of variables used per split) and num.trees (the number of trees built). Next we define a parameter set for these:

## Build parameter set
tune_ps = ParamSet$new(list(
  ParamInt$new("mtry", lower = 1, upper = 8),
  ParamInt$new("num.trees", lower = 100, upper = 1000)
))

Then, we define a new AutoTuner. This is simply a copy of the previous one, but with an updated learner and parameter set. Note that we again set the inner resampling strategy.

at_rf = AutoTuner$new(learner = lrn_rf, 
                      resampling = resampling_inner,
                      measure = measure, 
                      search_space = tune_ps,
                      terminator = evals,
                      tuner = tuner)

Again, we use resample to tune the model within the outer resampling strategy:

rr_rf = resample(task = task_pe, learner = at_rf, 
                 resampling = resampling_outer, store_models = TRUE)

This will take a little while to run; remember that this is dividing the original data set up three times, then for each one running tuning the parameters across the parameter set/resolution.

rr_rf$score(measure) 
##                 task task_id         learner           learner_id
## 1: <TaskClassif[45]>      pe <AutoTuner[37]> classif.ranger.tuned
## 2: <TaskClassif[45]>      pe <AutoTuner[37]> classif.ranger.tuned
## 3: <TaskClassif[45]>      pe <AutoTuner[37]> classif.ranger.tuned
##            resampling resampling_id iteration              prediction
## 1: <ResamplingCV[19]>            cv         1 <PredictionClassif[19]>
## 2: <ResamplingCV[19]>            cv         2 <PredictionClassif[19]>
## 3: <ResamplingCV[19]>            cv         3 <PredictionClassif[19]>
##    classif.auc
## 1:   0.9098910
## 2:   0.8887654
## 3:   0.8996970
rr_rf$aggregate(measure)
## classif.auc 
##   0.8994511

And again, we see a small jump in the AUC with the newly tuned model. To see the selected parameter values

sapply(rr_rf$learners, function(x) x$param_set$values)
##            [,1]          [,2]          [,3]         
## importance "permutation" "permutation" "permutation"
## mtry       2             6             1            
## num.trees  100           200           500

Variable importance plots

Next we’ll plot the permutation-based variable importance for this model. As a reminder, variable importance is a measure of how much worse a model becomes when we scramble the values of one of the features. The model is used to predict the outcome for some test data (here the out-of-bag samples) twice: once with the original values of the feature and once with randomly shuffled values. If there is a large difference in the skill of the model, this feature is important in controlling the outcome. We’ll use the vip() function from the vip to show and then plot the variable importance scores. As there are three possible models from the resampling, we’ll just plot the first. The model object is buried quite deep in the resampling output in rr_rf$learners[[1]]$model$learner$model:

vip(rr_rf$learners[[1]]$model$learner$model)

Which indicates that the bio10 (the annual temperature range) is most important in controlling the distribution of the pine.

Partial dependency plots

We can look at the form of the relationship between the occurrence of the pine and this feature (and any other one) using a partial dependency plot. This shows changes in the outcome across the range of some feature (with all other features held constant). Here, we’ll use the partial() function from the the pdp package to produce the plot. As arguments, this requires the model, the feature that you want the dependency on, the set of data used to produce the model. For a classification model, we can also specify which class to plot for. In this data, the presences (1) are the second class.

partial(rr_rf$learners[[1]]$model$learner$model, 
        pred.var = "bio7", prob = TRUE, 
        train = pe.df, plot = TRUE, which.class = 2)

The plot shows the range over which this species is found, with a clear peak in suitability at about 36-37 degrees.

Boosted regression trees

We will now build a boosted regression tree model for the Pinus data. In contrast to random forests that build a set of individual weak trees, boosted regression trees (BRTs) start with a single weak tree and iteratively improve on this. This is done by targeting the residuals from the previous set of models and trying to model that in the next tree. While this can make these methods very powerful, it is easy for them to overfit the data, and hyperparameter tuning becomes very important here.

We’ll follow the same steps as with the random forest. mlr3 uses the xgboost package, so we define our learner as classif.xgboost. The only hyperparameter that we will set here is subsample. This runs a stochastic boosting where each individual tree is built with a random subset of 50% of the training data.

lrn_brt = lrn("classif.xgboost", 
              predict_type = "prob", 
              subsample = 0.5)
rr = resample(task_pe, lrn_brt, resamp_hout, store_models = TRUE)
rr$score(measure)
##                 task task_id                     learner      learner_id
## 1: <TaskClassif[45]>      pe <LearnerClassifXgboost[31]> classif.xgboost
##                 resampling resampling_id iteration              prediction
## 1: <ResamplingHoldout[19]>       holdout         1 <PredictionClassif[19]>
##    classif.auc
## 1:   0.7594487
rr$aggregate(measure)
## classif.auc 
##   0.7594487

With the default settings, our BRT performs much worse than the random forest, and about the same as the initial CART model. We’ll next try to tune this to see if we can improve it’s performance. First, let’s take a look at the set of available hyperparameters:

lrn_brt$param_set

There is a long list of potential parameters to tune here. We’ll focus on three: eta (the learning rate), max.depth (the number of splits in a tree) and nrounds (the number of boosting iterations). Next we define a parameter set for these:

tune_ps = ParamSet$new(list(
  ParamDbl$new("eta", lower = 0.001, upper = 0.1),
  ParamInt$new("max_depth", lower = 1, upper = 6), ## Raise upper limit
  ParamInt$new("nrounds", lower = 100, upper = 1000)
))

Then, we define a new AutoTuner. This is simply a copy of the previous one, but with an updated learner and parameter set. Note that we again set the inner resampling strategy.

at_brt = AutoTuner$new(learner = lrn_brt, 
                       resampling = resampling_inner,
                       measure = measure, 
                       search_space = tune_ps,
                       terminator = evals,
                       tuner = tuner)

Again, we use resample to tune the model within the outer resampling strategy. This may take a few minutes to run - remember we are dividing the data into 3 folds, then running 50 tuning iterations for each one.

rr_brt = resample(task = task_pe, learner = at_brt, 
                  resampling = resampling_outer, store_models = TRUE)
sapply(rr_brt$learners, function(x) x$param_set$values)
##           [,1]  [,2]  [,3] 
## verbose   0     0     0    
## nrounds   100   100   700  
## subsample 0.5   0.5   0.5  
## eta       0.067 0.034 0.001
## max_depth 6     3     4

You should a much more substantial improvement in the AUC here.

rr_brt$score(measure) 
##                 task task_id         learner            learner_id
## 1: <TaskClassif[45]>      pe <AutoTuner[37]> classif.xgboost.tuned
## 2: <TaskClassif[45]>      pe <AutoTuner[37]> classif.xgboost.tuned
## 3: <TaskClassif[45]>      pe <AutoTuner[37]> classif.xgboost.tuned
##            resampling resampling_id iteration              prediction
## 1: <ResamplingCV[19]>            cv         1 <PredictionClassif[19]>
## 2: <ResamplingCV[19]>            cv         2 <PredictionClassif[19]>
## 3: <ResamplingCV[19]>            cv         3 <PredictionClassif[19]>
##    classif.auc
## 1:   0.8770032
## 2:   0.8907767
## 3:   0.8663636
rr_brt$aggregate(measure)
## classif.auc 
##   0.8780478

Variable importance plots

As before, we can use a variable importance plot to look at the contribution of the individual features to the model

vip(rr_brt$learners[[1]]$model$learner$model)

Prediction

In this last section, we will produce predictive maps of suitability for the pine for the modern and future periods. The first step will be to make a final model based on the tuned hyperparameters and the full dataset. We’ll use the random forest here as an example. As a reminder, the results of the tuning were:

sapply(rr_rf$learners, function(x) x$param_set$values)
##            [,1]          [,2]          [,3]         
## importance "permutation" "permutation" "permutation"
## mtry       2             6             1            
## num.trees  100           200           500

We’ll use the most common values of 100 trees and mtry = 1 for our final model and train it on the full dataset.

final_rf = lrn("classif.ranger", 
             predict_type = "prob", 
             importance = "permutation",
             mtry = 1,
             num.trees = 100)
final_rf$train(task_pe)

For prediction, we now need to extract the environmental data for every grid cell in our region into a dataframe. We also get a list of grid cell coordinates (this will help with plotting the results).

current.env.df <- as.data.frame(getValues(current.env))
grid.crds <- coordinates(current.env)

Next, we remove any grid cell with missing values (those over the oceans), and the associated coordinates. Finally here we make a list of the grid cell number or index. We’ll use this to make maps of the predictions.

naID <- which(is.na(current.env.df$bio1))
current.env.df <- current.env.df[-naID,]
grid.crds <- grid.crds[-naID,]
grid.idx <- cellFromXY(current.env, grid.crds)

We can now use this dataframe with the predict_newdata method to estimate the suitability.

pe.curr.pred <- final_rf$predict_newdata(current.env.df)

And finally, we use the grid cell indices to make a new raster layer showing the predictions. We do this by first making a template raster as a copy of one of the environmental layers, then using the grid cell index to set values to the probability of suitability. Finally, we plot this and overlay the original points:

  • A raster* object to be used as a template for resolution and extent
  • The set of values (here the probability of a ‘1’)
  • The grid cell indices
pred.pal = brewer.pal(9, "Reds")
pe.curr.prob.r <- raster(current.env,1)
pe.curr.prob.r[grid.idx] <- pe.curr.pred$prob[,2]
plot(pe.curr.prob.r, col = pred.pal)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)

This shows that model captures the current distribution well, but also predicts a large area of suitability in the north-west. We can also plot the predicted class (0/1). This requires converting it from R’s factor class to a numeric value

pe.curr.class.r = raster(current.env,1)
tmp.pred <- as.numeric(pe.curr.pred$response) - 1
pe.curr.class.r[grid.idx] <- tmp.pred
plot(pe.curr.class.r)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)

Now let’s do the same with the future environment. First extract the values and coordinates.

future.env.df = as.data.frame(getValues(future.env))
grid.crds = coordinates(future.env)
names(future.env.df) <- names(current.env.df)

Now remove the missing values and set the grid index

naID = which(is.na(future.env.df$bio1))
future.env.df = future.env.df[-naID,]
grid.crds = grid.crds[-naID,]
grid.idx = cellFromXY(future.env, grid.crds)

Predict the suitability:

pe.rcp85.pred = rr_rf$learners[[1]]$model$learner$predict_newdata(future.env.df)

And plot it:

pe.rcp85.prob.r <- raster(future.env,1)
pe.rcp85.prob.r[grid.idx] <- pe.rcp85.pred$prob[,2]
plot(pe.rcp85.prob.r, col = pred.pal)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)

You should here that the suitability has been reduced in the south, but increased in the north with warming temperatures. This is shown more clearly by looking at the predicted classes

pe.rcp85.class.r <- raster(future.env,1)
tmp.pred <- as.numeric(pe.rcp85.pred$response) - 1
pe.rcp85.class.r[grid.idx] <- tmp.pred
plot(pe.rcp85.class.r)
plot(st_geometry(pe), pch = 21, cex = 0.5, add = TRUE)

And finally, we can illustrate the change in range:

my.pal <- brewer.pal(3,'PRGn')
plot(pe.rcp85.class.r - pe.curr.class.r, col=my.pal)

Exercise

In a previous lab, you used data from the Sonar.csv file to model types of object (rocks ‘R’ or mines ‘M’) using the values of a set of frequency bands. The exercise for this lab is to use one of the ensemble methods (random forest or boosted regression trees) to produce a new model of these data. You should use mlr3 package, and you need to do the following:

  • Set up a task, learner, outer resampling strategy and performance measure
  • Use the AutoTuner to tune your model
  • Report the AUC score for the tuned model
  • Produce a variable importance plot

In addition, your answer should include your R code.

Appendix 1: Bioclimate variables

  • BIO1 = Annual Mean Temperature
  • BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
  • BIO3 = Isothermality (BIO2/BIO7) (* 100)
  • BIO4 = Temperature Seasonality (standard deviation *100)
  • BIO5 = Max Temperature of Warmest Month
  • BIO6 = Min Temperature of Coldest Month
  • BIO7 = Temperature Annual Range (BIO5-BIO6)
  • BIO8 = Mean Temperature of Wettest Quarter
  • BIO9 = Mean Temperature of Driest Quarter
  • BIO10 = Mean Temperature of Warmest Quarter
  • BIO11 = Mean Temperature of Coldest Quarter
  • BIO12 = Annual Precipitation
  • BIO13 = Precipitation of Wettest Month
  • BIO14 = Precipitation of Driest Month
  • BIO15 = Precipitation Seasonality (Coefficient of Variation)
  • BIO16 = Precipitation of Wettest Quarter
  • BIO17 = Precipitation of Driest Quarter
  • BIO18 = Precipitation of Warmest Quarter
  • BIO19 = Precipitation of Coldest Quarter